Causal inference

Neil Lund

2025-01-14

Causation

Goals of Modeling

All models are wrong, but how they’re wrong matters.

Up to this point, we’ve mostly focused on more banal kinds of wrongness:

  • Issues like heteroskedasticity, overdispersion, non-normality etc. of a regression can distort p.values and confidence intervals.

  • Using a linear model for a non-linear relationship can bias regression coefficients and give us inaccurate or nonsensical predictions

  • Autocorrelation can it look like we have a lot of data when we actually have a very small number of independent observations.

Violating these assumptions matters, but we have relatively easy fixes.

Goals of Modeling

All models are wrong, but the importance of that wrongness depends on our goals:

  • If our primary goals are prediction or description, we might care very little about spurious correlation

    • We can use ice cream sales to predict drowning deaths even though the correlation is spurious. (There are certainly better ways to predict drowning deaths, but it will work in a pinch.)
  • If our primary goal is prescription or explanation, then we probably care a lot about spuriousness:

    • Banning ice cream sales definitely won’t reduce drowning deaths because there’s no causal relationship here.

Goals of Modeling

Up to now, we’ve mostly been focusing on those more banal problems, but for the latter part of the course we’ll be talking about the more difficult problem of identifying causal relationships like:

  • do get out the vote campaigns cause people to vote?

  • does a person’s race cause police to treat them differently?

  • do housing first policies cause a reduction in homelessness?

The fundamental problem of causal inference

  • Causal claims rely on counterfactuals: “if X, then Y” implies “If not X, then not Y”

  • But we never actually observe counterfactuals!

Causation: smoking

Causal claim: Humphry Bogart got cancer from cigarettes

if Humphrey Bogart never smoked, he would have been less likely to get cancer

…but we never observe this outcome, and it matters because Bogart did all sorts of other bad stuff that could cause cancer.

Causation: smoking

The more generic version of this claim isn’t necessarily any easier to prove:

  • Causal claim: “if fewer people smoked in, all else equal, we would expect to see fewer lung cancer deaths.”

  • We have evidence, but its still only a correlation. All sorts of things changed in this time frame

Smoking seemed inevitable until it didn’t The chart summarizes the history of smoking in the US (the development in other high-income countries was similar). I plotted two different metrics: in purple, you see the rise and fall of cigarette sales, which you can read off the values on the left-hand axis. In red, you see the rise and fall of lung cancer deaths, which you can read off the axis on the right. Smoking was very much a 20th-century problem. It was rare at the beginning of the century, but then – decade after decade – it became steadily more common. By the 1960s, it was extremely widespread: on average, American adults were buying more than 10 cigarettes every day. The statistical work that identified smoking as the major cause of the rise in lung cancer deaths began in the post-war periods and culminated in the 1964 report of the Surgeon General. This report is seen as a turning point in the history of smoking as it made clear to the public just how deadly it was.11 Once people learned that smoking kills, they could act on it. It took some time, but they did. I wasn’t alive during peak smoking, but even I remember how very common it was to smoke in places where it would be unthinkable today. Looking back, I also remember how surprised I was by how quickly smoking then declined. It is a good reminder of how wrong it often is to think that things cannot be different – for a long time, smoking kept on increasing and it looked as if it would never change. But then it did. Nearly half of all former smokers have quit,12 cigarette sales declined to a third of what they once were, and the death rate from lung cancer declined.

https://ourworldindata.org/smoking-big-problem-in-brief

Causation: smoking

This should be a familiar problem! We talk about endogeneity or spurious correlation all the time

mrdag U Confounder X Treatment U->X Y Outcome U->Y X->Y

In observational research, we can try to address this with control variables, but those may not be adequate for really complex confounding.

Back to that fundamental thing

Causal claims rest on something we never observe.

  • A more pessimistic view is that this is basically unsolvable:

  • A somewhat more optimistic view is that we can solve this for aggregate probabilistic claims if some very strict assumptions are met.

Potential outcomes framework

Years lived if you smoke \[Y_i(1)\] Years lived if you quit \[Y_i(0)\] Effect of smoking vs. quitting:

\[Y_i(1) - Y_i(0)\]

An ideal study

Subject Smoker Quitter difference
A 60 71 11
B 72 70 -2
C 72 84 12
D 71 60 -11
E 72 75 3
F 52 64 12
G 70 80 10
Average 67 72 -5

A realistic study

Subject Smoker Quitter Difference
A ?? 71 ??
B 72 ?? ??
C 72 ?? ??
D 71 ?? ??
E 72 ?? ??
F 52 ?? ??
G ?? 80 ??
Average 68 76 -7.5

Expected and conditional outcomes

Imagine we have some treatment \(D\) that reliably induces people to stop smoking.

Then we need to estimate the value (the mean) for subject \(i\) conditional on assigning treatment \(D\)

\[E[Y_i(1)|D_i = 1]\]

And also their expected value conditional on **not being assigned to the treatment group*. And we can’t observe both of these simultaneously.

\[E[Y_i(0)|D_i = 1]\]

Expected and conditional outcomes

We want the expected years of life for individual i if they smoke compared to expected years of life for non-smokers:

\[E[Y_i(1) - Y_i(0)]\]

But we only have expected years of life for each group separately conditional on a treatment

\[E[Y_i(1)|D_i=1], E[Y_i(0)|D_i=0]\]

If we can assume the treatment assignment \(D_i\) is random and thus uncorrelated (\(\unicode{x2AEB}\)) with any other predictors of life expectancy \(X_i\)

\[Y_i(0), Y_i(1), X_i \mathrel{\unicode{x2AEB}} D_i\]

…then the conditional expectation is the same as the unconditional expectation and the effect is just the difference of means between each group (plus some random error)

\[E[Y_i(1)|D_i=1] - E[Y_i(0)|D_i=0] = E(Y_1 - Y_0)\]

So what?

This is a roundabout way of saying that the problem of causal inference is solvable in the aggregate if and only if the “cause” is uncorrelated with any other characteristic that influences the outcome. If those hold, then a simple regression model or difference of means test can identify a causal relationship!

graph LR
A[Lifestyle]
B[Smoking]
C[Cancer]
  A-->B
  A-->C
  B-->C

Confounding because smoking is correlated with lifestyle

Notably, we don’t need to account for everything that impacts life expectancy here. We just need to ensure that those other predictors are not correlated with the treatment we’re interested in.

graph LR
A[Lifestyle]
B[Smoking]
C[Cancer]
  A-->C
  B-->C

No confounding (even though lifestyle still impacts cancer risk)

Three key assumptions for causal inference

  1. No confounding we can’t observe the counterfactual for any individual, but we can infer an average counterfactual for groups provided we can assume that the treatment is uncorrelated with any confounders \(Y_i(0), Y_i(1), X_i \mathrel{\unicode{x2AEB}} D_i\)

  2. The excludability assumption requires that the treatment itself is is the only thing that impacts the outcome (so we need to rule out things like placebo effects)

  3. The non-interference assumption (aka the Stable Unit Treatment Value Assumption or SUTVA) assumes that treatment assignment for one unit doesn’t impact the others (for instance, if people in the treatment group influence people in the control group)

Terminology: Treatment effects

  • Average Treatment Effect (ATE) the average difference between \(Y_i(1) - Y_i(0)\)

  • Average Treatment Effect on the treated (ATT) \(E[Y_i(1)|D_i=1] - E[Y_i(0)|D_i=1]\)

  • Average Treatment Effect on the untreated (ATU) \(E[Y_i(1)|D_i=0] - E[Y_i(0)|D_i=0]\)

In the idealized scenario, these should be equivalent, but in practice they will likely diverge due to things like heterogeneous treatment effects, non-compliance, and imbalance between treatment and control units.

Experimental Research

Experiments are the most straightforward way to satisfy the “no confounding” assumption, despite their limitations.

Example: immigration attitudes

Sides and Citrin (2007): people who overestimate immigration numbers tend to have more negative attitudes towards immigration. But is this a causal relationship?

Sides, J., & Citrin, J. (2007). European opinion about immigration: The role of identities, interests and information. British journal of political science, 37(3), 477-504.

Sides, J., & Citrin, J. (2007). European opinion about immigration: The role of identities, interests and information. British journal of political science, 37(3), 477-504.

Example: immigration attitudes

We could have multiple kinds of confounding here, including the possibility that anti-immigration attitudes impact misperceptions of immigrant numbers (simultaneous causation)

graph LR
A[low information etc.]
B[overestimating immigrant numbers]
C[anti-immigrant attitudes]
A==>B
A<==>C
B-->C

confounding

Example: immigration attitudes

With confounding, the difference between “overestimators” compared to people with accurate perceptions:

\[\underbrace{E[Y_i(1)|D_i=1] - E[Y_i(0)|D_i=0]}_\text{the difference between overestimators and non-overestimators}\]

Is actually a combination of the actual effect and the effect of confounding:

\[\underbrace{E[Y_i(1) - Y_i(0)|D_i=1]}_\text{actual effect of overestimating} + \underbrace{E[Y_i(0)|D_i=1] - E[Y_i(0)|D_i=0]}_\text{the effect of other stuff correlated with overestimation}\] We may not even know what all of the “other stuff” is, so including more controls may not solve this.

Example: immigration attitudes

Hopkins, Sides and Citrin (2018): The Muted Consequences of Correct Information about Immigration

Question: People consistently overestimate immigrant populations. Does giving them correct information about immigration levels influence their attitudes?

Method: a survey experiment. Randomly assign some survey respondents to receive correct information about immigration levels before asking them their views.

Effects of the different experimental treatments on support for increasing or decreasing levels of legal immigration (left) and on the index of anti-immigration attitudes (right). Dots indicate mean levels, and the horizontal lines are 95% confidence intervals. The vertical lines at the bottom present the jittered distribution of the dependent variable in 2010. CCES = Cooperative Congressional Election Survey; KN = Knowledge Network.

Left: effect of different random treatments on support for increasing vs. decreasing immigration. Right: effects on anti-immigration attitudes.

Non-experimental research

Observational research still has this same basic problem, if we don’t talk about “treatment” in the same way:

Hypothesis: Fox News viewers are less likely to get the Covid vaccine.

Issue: the \(E[Y_i(1)]\) for Fox News viewers is not the same as \(E[Y_i(1)]\) for non-Fox viewers. The “treatment” and “control” groups are different for all sorts of reasons. So \[\underbrace{E[Y_i(1)|D_i=1] - E[Y_i(0)|D_i=0]}_\text{difference between viewers and non-viewers}\] Is now a combination of:

\[\underbrace{E[Y_i(1) - Y_i(0)|D_i=1]}_\text{treatment effect for Fox News viewers} + \underbrace{E[Y_i(0)|D_i=1] - E[Y_i(0)|D_i=0]}_\text{effect of predisposition to watch Fox News}\]

Moving to observational research

Or the role of racial bias in motivating traffic stops

\[\underbrace{E[Y_i(1) - Y_i(0)|D_i=1]}_\text{racial discrimination effect} + \underbrace{E[Y_i(0)|D_i=1] - E[Y_i(0)|D_i=0]}_\text{non-racial profiling and actual rate of minor traffic violations}\]

Moving to observational research

Or fears of being drafted on war attitudes

\[\underbrace{E[Y_i(1) - Y_i(0)|D_i=1]}_\text{fear of being drafted} + \underbrace{E[Y_i(0)|D_i=1] - E[Y_i(0)|D_i=0]}_\text{pre-existing attitudes about war}\]

Why not just include control variables?

Typically, we try to account for this sort of thing using control variables. But cramming stuff in can also introduce problems:

  • All confounders must be measured and accounted for
  • All non-linear/interactive confounding must be modeled as well.
  • Multicollinearity means that we have a strict limit on the number of control variables (and means we have diminshing returns well before that)
  • The inclusion of bad controls (colliders) can actually make bias worse rather than better.

To get a better sense of this, we need to take a brief detour into DAGs

DAGs and collision

Directed Acyclic Graphs: describe relevant causal relationships between a IV and a DV of interest.

  • Nodes are variables: Z, X, and Y

  • Arrows (aka edges) indicate causal relationships so \(X\rightarrow Y\), \(Z \rightarrow Y\) and \(Z \rightarrow X\)

  • A path is anything you can draw to connect two nodes. So \(X \rightarrow Y\) and \(X \leftarrow Z \rightarrow Y\) are both paths that could lead from X to Y

  • Our goal is to only have paths going from right to left for X on Y by “closing all open backdoor paths”, usually by conditioning on those variables. So here, we want to control for Z to close \(X \leftarrow Z \rightarrow Y\)

mrdag Z Z X X Z->X Y Y Z->Y X->Y

DAGs

  • Including a control variable in a regression is one of several ways to condition, but it requires us to measure Z and include it in the model.

  • Unobserved confounding (sometimes represented using dashed lines) can be addressed by randomization, but not by regression

mrdag Z Z X X Z->X Y Y Z->Y X->Y

DAGs

Collision happens when two variables along the causal path from X to Y point to the same node.

So the collider here is \(X \rightarrow Z \leftarrow Y\)

mrdag Z Z X X X->Z Y Y X->Y Y->Z

DAGs

Collider paths are “closed” on their own. So, unlike the confounding case, conditioning on these paths causes bias rather than reducing it.

mrdag Z Z X X X->Z Y Y X->Y Y->Z

Collider bias

Collider bias is really more general version of the selection bias problem. Recall that selection bias occurs when we only see outcomes above a certain threshold. For instance: when low expectations of earnings cause people to drop out of the sample.

Since these outcomes are unobserved, models looking at this group are stratified based on the collider

mrdag Z labor force participation X education X->Z Y wages X->Y Y->Z

Collider bias

Collider bias

The selection problem frames this as a function of a failure to consider a group, but it could also be thought of as a form of conditioning on potential outcomes: the probability of seeing certain observations depends on \(E[Y_i(1)]\).

true modelwithout colliderwith collider
(Intercept)9.267    68.450 ***-5.578    
(6.063)   (5.962)   (4.686)   
educ_year9.453 ***6.594 ***6.028 ***
(0.390)   (0.367)   (0.326)   
lfpin labor force                82.980 ***
                (3.141)   
N1000        800        1000        
R20.370    0.288    0.630    
logLik-5279.006    -4043.012    -5013.666    
AIC10564.013    8092.024    10035.333    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Collider bias

Selection problems can cause biased estimates. Including a control for a collider has a similar impact.

This might seem like an odd thing to do, but it actually comes up a lot!

For instance in discussions of discrimination: people will advocate for examining wage disparities after controlling for things that (like job title) that are themselves downstream from discrimination and wages.

Collider bias

  • Imagine a scenario where a company only promotes women to management roles if they are in the 10th percentile for ability, but promotes men if they’re in the 50th percentile.

mrdag U Title Y Earnings U->Y A Ability A->U A->Y X Discrimination X->U X->Y

Collider bias

All paths from discrimination to earnings:

  • Discrimination \(\rightarrow\) Earnings (direct effect)

  • Discrimination \(\rightarrow\) Job title \(\rightarrow\) Earnings (mediated effect)

  • Discrimination \(\rightarrow\) Job title \(\leftarrow\) Ability \(\rightarrow\) Earnings (collider)

mrdag U Title Y Earnings U->Y A Ability A->U A->Y X Discrimination X->U X->Y

Collider bias

What happens if we dis-aggregate by position and then compare wages? Why?

Collider bias

One way to think about this is in terms of potential outcomes: the women who get promoted - if not for discrimination - would have already have a higher expected wage than the men. So when you stratify on job title, you end up comparing women who would have high wages if not for discrimination to men who would have lower wages if not for discrimination:

Considerations

  • For colliders \(\rightarrow Z \leftarrow\) the best approach is actually to do nothing. They’re already “closed”

    • Collision is one reason to be skeptical of “garbage can” regression models that try to account for everything.
  • On the other hand, we do need to control for confounding \(\leftarrow Z \rightarrow\), but this assumes we can measure it.

Quasi-experimental methods

  • Randomization allows for true causal inference, but its often not an option

  • Regression can do this, but only under very strict conditions and there’s a real risk of garbage can models making things worse.

  • Quasi-experimental methods look for ways to ensure that treatment is independent of potential outcomes, or, barring that, that potential outcomes are balanced between treated and control units.